Simple Random Walk

Andrew Kerr

2023-04-06

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Two equally matched opponents are competing in a game in which changes in score occur often and in one point increments. (Imagine a basketball game in which every basket counts only one point.) We’ll use simulation to investigate the following questions.

Which is more likely: that one team leads for most of the game, or that the lead tends to change frequently over the course of the game?
When would you expect the largest lead (or deficit) to occur — near the beginning, the end, or in the middle of the game? (If the largest lead (or deficit) is attained at several points in the game, when you do expect it to first occur?)
When would you expect the last tie to occur — near the beginning, the end, or in the middle of the game?

Pause to think about these questions before proceeding.

Let Xn be the point difference (team A - team B) after the first scores; Xn > 0 if team A is in the lead, Xn < 0 if team B is in the lead, and Xn = 0if the score is tied. We can model Xn as a simple symmetric random walk on the integers starting from X0 = 0, like in the Harry and Tom example from the handout.

Consider the first 2n steps of the walk. We say 2n because that will be even. We are interested, in particular, in times at which the walk can be in state 0 (that is, the score is tied), which can only happen in an even number of steps.

We are interested in the following random variables, each of which has been scaled to take values between 0 and 1.

T/(2n), where T is the time, between 0 and 2n, at which the walk is last in state 0. For example, if n = 50 and the walk takes 100 steps, and it is last at 0 at step 84, then T/(2n) = 0.84 (the last tie occurs with 16% of the game remaining.)
L/(2n), the fraction of time the walk stays above 0, where L the number of time steps for which the walk is above 0; L/(2n) is the proportion of the game team A is in the lead.
M/(2n), where M is the first time, between 0 and 2n, at which the maximum value, over time 0 to 2n, of the walk is attained; M/(2n) is the first time (measured as a proportion of the full game) at which team A’s largest lead is attained.

Write your own code to conduct and run a simulation to approximate the distribution of each of T/(2n), L/(2n), and M/(2n) for n = 100. Summarize the results with appropriate plots and summary statistics, and describe the distributions. Consider the three questions at the start of this page; what do your simulation results suggest? Write a brief report summarizing your results and conclusions.

Optional: experiment with different values of n and discuss how the results change with n.

Random Walk Function

random_walk <- function(sim_count, n, T_plt, L_plt, M_plt){
  n_sims <- sim_count
  n <- n
  
  results <- matrix(NA, 
                    nrow = n_sims, 
                    ncol = 3, 
                    dimnames = list(c(), c("T", "L", "M")))
  sims <- data.frame(time = 0:(2*n))
  
  for(i in 1:n_sims){
    N <- 0:(2*n)
    X <- rbinom((2*n), 1, 0.5)
    X <- replace(X, X == 0, -1)
    X <- c(0, X)
    Xn <- cumsum(X)
    
    sim <- data.frame(N = N, X = X, Xn = Xn)
      
    T <- sim %>%
      filter(Xn == 0) %>%
      summarise(max(N))
  
    L <- sim %>%
      filter(Xn > 0) %>%
      count()
    
    M <- sim %>%
      slice_max(Xn, with_ties = FALSE) %>%
      select(N)
    
    sims[i+1] <- sim$Xn
    results[i,] <- c(as.double(T/(2*n)), as.double(L/(2*n)), as.double(M/(2*n)))
  }
  
  
  walks_10 <- sims[1:11] %>%
    pivot_longer(cols = !time, names_to = "walk", values_to = "score") %>%
    ggplot() +
    geom_line(aes(time, score, color = walk)) +
    theme_bw()
  
  plot(walks_10)
    
  if(T_plt) {
    hist(results[, 1], breaks = seq(0, 1, 0.05), 
         main = paste("Last Tie N =", 2*n), 
         xlab = "Proportion of Game Played Until Last Tie")
  }
  if(L_plt) {  
    hist(results[, 2], breaks = seq(0, 1, 0.05), 
         main = paste("Walk Lead N =", 2*n), 
         xlab = "Proportion of Game Team A Led")
  }
  if(M_plt) {
    hist(results[, 3], breaks = seq(0, 1, 0.05), 
         main = paste("First Time Max Lead N =", 2*n), 
         xlab = "Proportion of Game Team A Attained Max Lead")
  }
  
  summary(results)
}

Simulations

N = 100

random_walk(10000, 100, TRUE, TRUE, TRUE)

##        T               L                M         
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.150   1st Qu.:0.1200   1st Qu.:0.1250  
##  Median :0.500   Median :0.4550   Median :0.4600  
##  Mean   :0.503   Mean   :0.4742   Mean   :0.4782  
##  3rd Qu.:0.860   3rd Qu.:0.8250   3rd Qu.:0.8300  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000

Results

All three histograms have similar shapes/distributions; there are very large counts at the beginning and end of the game, with the middle of the games appearing roughly uniform. Specifically, a U-shaped distribution can be seen among all three graphs.
The mean of all three distributions is about 0.48, with similar spreads among the distributions as well (the 1st and 3rd Quantiles also line up).
While the distributions for L/(2n) and M/(2n) are almost exactly the same, the statistics for T/(2n) are slightly larger (e.g. means: .4973 v .4816 v .4806).

Conclusions

Which is more likely: that one team leads for most of the game, or that the lead tends to change frequently over the course of the game?
- One team leads for most of the game (see L/(2n) plot). Either team A kept the lead the entire game (proportion = 1), or team B kept the lead the entire game (proportion = 0).
When would you expect the largest lead (or deficit) to occur — near the beginning, the end, or in the middle of the game? (If the largest lead (or deficit) is attained at several points in the game, when you do expect it to first occur?)
- End of the game (see M/(2n)). Either team A attained there largest lead at the end of the game, or the at beginning of the game which represents team B obtaining and holding the lead from the start of the game.
When would you expect the last tie to occur — near the beginning, the end, or in the middle of the game?
- The beginning or the end of the game (see T/(2n)). Either the last tie was the starting score, 0-0, or the last tie was at the end of the game, 100-100.
These conclusion went against what I originally believed to be true. Before running these simulations, I had thought that the game would be mostly back and forth, with lots of ties and lead changing throughout. I did not expect that either, one team will get the lead and hold onto it the rest of the game or the back and forth ties would happen with similar chances.

N = 10 v. N = 100 v. N = 500

T/(2n)

for(i in c(10, 100, 500)){
  random_walk(10000, i, TRUE, FALSE, FALSE)
}

As n increases, the middle of the distribution takes on a more uniform shape. However, at all three values of n the overall shape of the distribution is the same: large frequencies at 0 and 1 with a U-shape of frequencies in between.

L/(2n)

for(i in c(10, 100, 500)){
  random_walk(10000, i, FALSE, TRUE, FALSE)
}

As n increases, the frequency of 1 begins to catch up to the frequency at 0. Unlike T/(2n), the U-shape of large values of N is not present at N = 10. This suggests that in these simulations, most of the game was a tie or team B held the lead.

M/(2n)

for(i in c(10, 100, 500)){
  random_walk(10000, i, FALSE, FALSE, TRUE)
}

As n increases, the frequency of 1 begins to catch up to the frequency at 0. As I noted in my results, L/(2n) and M/(2n) appear slightly more similar to each other than T/(2n) is. This is reflected in how low counts of n for L/(2n) and M/(2n) appear similar, with most proportions at 0, then 1, while the proportions for T/(2n) are roughly equal at 0 and 1 - even at low counts of n.